Summary
Transformer models will attend to gap tokens, such as <bos>, <eos>, etc., to avoid updates when calculating attention (1).
Details
This is also true of visual transformers and image data.
(Miller 2023) proposes using Quiet Attention AKA Softmax1 to avoid this:
This was used by (2) to modify the Invariant point attention.
1.
Bondarenko Y, Nagel M, Blankevoort T. Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing. Advances in Neural Information Processing Systems. 2023;36:75067–96. Available from: https://papers.nips.cc/paper_files/paper/2023/hash/edbcb7583fd8921dad78adecfe06a99b-Abstract-Conference.html
2.
Billera L, Oresten A, Stålmarck A, Sato K, Kaduk M, Murrell B. The Continuous Language of Protein Structure. openRxiv; 2024. Available from: https://doi.org/10.1101/2024.05.11.593685